# THE USER CAN SELECT A SPECIES HERE!
# Alnus, Betula, Corylus or Poaceae (Alder, Birch, Hazel or Grasses)

species_sel <- "Poaceae"

In this project we want to evaluate the new pollen forecast module that uses “realtime data” for calibration. As new automatic pollen monitors are being deployed, realtime calibration of pollen forecast from weather models becomes more and more relevant.

This study uses old Hirst measurements and weather model reforcasts to investigate various implementations and possibilities in terms of forecast improvements.

For detailed information about the final implementation in COSMO-1E please refer to this documentation page (ask meteoswiss for access): https://service.meteoswiss.ch/confluence/x/dYQYBQ

The Data

Three datasets are available for that endeavour:

We have four species available for this analysis. The months were selected with our pollen experts and data was retrieved from the model runs.

The analysis will be carried out for each species individually, but all for all stations and both years combined.

The observations are provided at 14 different stations, whereof one is excluded from the analysis: Davos/Wolfgang is high up in the mountains and pollen measurements are almost always zero.

The following settings are crucial and should always be remembered when running the chunks below:

The measured and modelled values were artificially increased by 0.001 to enable taking the log and division by zero.

Residual Analysis

First we compare the three data sets with the traditional ANOVA approach. Statistical inference (p-values, confidence intervals, . . . ) is only valid if the model assumptions are fulfilled. So far, this means (many paragraphs are quoted from Lukas Meier ETH - Script Applied Statistics ANOVA Course):

The first assumption is most crucial (but also most difficult to check). If the independence assumption is violated, statistical inference can be very inaccurate. In the ANOVA setting, the last assumption is typically not as important compared to a regression setting, as we are typically fitting “large” models.

Are the errors independent?

In a QQ-plot we plot the empirical quantiles (“what we see in the data”) vs. the theoretical quantiles (“what we expect from the model”). The plot should show a more or less straight line if the distributional assumption is correct. By default, a standard normal distribution is the theoretical “reference distribution”.

They are definitely not and we have to do some adjustments. So for the following plot we logarithmic the data to deal with the right-skewedness. The best results were achieved by first logarithmic the data and then taking the square root.

### Do the errors have mean zero? & Is the error variance constant? The Tukey-Anscombe plot plots the residuals vs. the fitted values. It allows us to check whether the residuals have constant variance and whether the residuals have mean zero (i.e. they don’t show any deterministic pattern). We don’t plot the smoothing line as loess (and other) algorithms have issues when the same value is repeated a large number of times (jitter did not really help).

Are the errors normally distributed?

If the data has some serial structure (i.e., if observations were recorded in a certain time order), we typically want to check whether residuals close in time are more similar than residuals far apart, as this would be a violation of the independence assumption. We can do so by using a so-called index plot where we plot the residuals against time. For positively dependent residuals we would see time periods where most residuals have the same sign, while for negatively dependent residuals, the residuals would “jump” too often from positive to negative compared to independent residuals.

Summary: Redidual Analysis shows that the assumptions of “normal” statiscal methods are validated even for log(daily values). It is therefore suggested to continue the analysis with robust and simple metrics.

Visual Assessment

Basic Plots

General overview of the daily concentration values as represented in the three timeseries. Each plot shows one species in one year. The seperate lines represent individual measurement stations.

First as a boxplot and second as a histogram of the daily differences.

Correlation Plots

The correlation between the Model and Measurements can be calculated easily and then the CI and p-values must be adjusted for multiple comparison. The corr-test function from the psych handily offers this functionality.

Careful the correlation coefficients method have some serious shortcomings:

The correlation coefficient measures linear agreement–whether the measurements go up-and-down together. Certainly, we want the measures to go up-and-down together, but the correlation coefficient itself is deficient in at least three ways as a measure of agreement. (http://www.jerrydallal.com/LHSP/compare.htm)

  • The correlation coefficient can be close to 1 (or equal to 1!) even when there is considerable bias between the two methods. For example, if one method gives measurements that are always 10 units higher than the other method, the correlation will be 1 exactly, but the measurements will always be 10 units apart.
  • The magnitude of the correlation coefficient is affected by the range of subjects/units studied. The correlation coefficient can be made smaller by measuring samples that are similar to each other and larger by measuring samples that are very different from each other. The magnitude of the correlation says nothing about the magnitude of the differences between the paired measurements which, when you get right down to it, is all that really matters.
  • The usual significance test involving a correlation coefficient– whether the population value is 0–is irrelevant to the comparability problem. What is important is not merely that the correlation coefficient be different from 0. Rather, it should be close to (ideally, equal to) 1!

A good summary of the methods and their shortcomings can be found here: https://www.statisticssolutions.com/correlation-Pearson-Kendall-spearman/

Altman Bland Plots

The well established AB-method for clinical trials can be used here as well to compare the means and differences between datasets. If the points lie within the two SD-line for the differences the datasets can be assumed to be strongly associated with each other.

Density Plots

These plots allow to observe the error for different concentration categories.

Statistical Assessment

First, various metrics are compared where the pollen concentrations are considered a continuous numerical variable.

Poaceae
R2 Bias SD MAE RMSE MSLE RMSLE type
0.6415216 -15.192597 46.72033 18.67645 49.12441 0.4208254 0.6487106 baseline
0.6962629 -4.826044 43.04682 15.81800 43.31262 0.1578102 0.3972533 calibration

Second, the values will be converted into health impact based buckets. The impact classes have been defined https://service.meteoswiss.ch/confluence/x/1ZG4 Now we can investigate various metrics that are typically used for categoric variables. The Kappa metric is explained here and was chosen as the most meaningful metric for this analysis: https://towardsdatascience.com/multi-class-metrics-made-simple-the-kappa-score-aka-cohens-kappa-coefficient-bdea137af09c

Poaceae
type Accuracy Kappa
baseline 0.3895043 0.1613036
calibration 0.6065008 0.4311992

To takes into account that the health impact levels are ordered. We can treat them as numerical values from 0:nothing - 4: very strong.

Poaceae
MAE_baseline MAE_calibration
0.7161794 0.4247322

The following table could be used in the appendix.

Reference Event No Event Predicted Event A B No Event C D The formulas used here are:

Poaceae
Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall F1 Prevalence Detection Rate Detection Prevalence Balanced Accuracy class type
0.956 0.481 0.383 0.970 0.383 0.956 0.547 0.252 0.241 0.629 0.718 Class: nothing baseline
0.747 0.854 0.633 0.909 0.633 0.747 0.686 0.252 0.188 0.298 0.801 Class: nothing calibration
0.180 0.775 0.395 0.536 0.395 0.180 0.247 0.450 0.081 0.205 0.477 Class: weak baseline
0.666 0.724 0.664 0.726 0.664 0.666 0.665 0.450 0.299 0.451 0.695 Class: weak calibration
0.168 0.932 0.289 0.871 0.289 0.168 0.212 0.142 0.024 0.082 0.550 Class: medium baseline
0.363 0.913 0.409 0.897 0.409 0.363 0.385 0.142 0.052 0.126 0.638 Class: medium calibration
0.334 0.956 0.507 0.914 0.507 0.334 0.403 0.119 0.040 0.079 0.645 Class: strong baseline
0.424 0.951 0.541 0.924 0.541 0.424 0.475 0.119 0.051 0.094 0.688 Class: strong calibration
0.104 0.999 0.808 0.967 0.808 0.104 0.185 0.036 0.004 0.005 0.552 Class: verystrong baseline
0.448 0.984 0.517 0.979 0.517 0.448 0.480 0.036 0.016 0.032 0.716 Class: verystrong calibration

Robust Contrasts with Confidence Intervals

https://www.researchgate.net/publication/282206980_nparcomp_An_R_Software_Package_for_Nonparametric_Multiple_Comparisons_and_Simultaneous_Confidence_Intervals The R package nparcomp implements a broad range of rank-based nonparametric methods for multiple comparisons. The single step procedures provide local test decisions in terms of multiplicity adjusted p-values and simultaneous confidence intervals. The null hypothesis H0: p = 1/2 is significantly rejected at 5% level of significance for many pairwise comparisons. Whenever the p-Value is < than 5% = the confidence interval contains 0.5 -> the effect from the factor trap is not statistically meaningful. The Estimator can also be interpreted as a proxy for the relative difference in median between Model and Measurements. If the Estimator is > 0.5 then the model tends to have larger measurements.

## 
##  #------Nonparametric Multiple Comparisons for relative contrast effects-----# 
##  
##  - Alternative Hypothesis:  True relative contrast effect p is not equal to 1/2 
##  - Type of Contrast : Dunnett 
##  - Confidence level: 95 % 
##  - Method = Logit - Transformation 
##  - Estimation Method: Pairwise rankings 
##  
##  #---------------------------Interpretation----------------------------------# 
##  p(a,b) > 1/2 : b tends to be larger than a 
##  #---------------------------------------------------------------------------# 
## 
Robust Contrasts and Confidence Intervals for Poaceae Pollen Measurements
Taxon Comparison Estimator Lower Upper pValue
Poaceae p( measurement , baseline ) 0.365 0.353 0.376 0
Poaceae p( measurement , calibration ) 0.481 0.469 0.493 0.00101794050235549

Reviewer 1 requested an additional graph displaying the changes of the tune factor over the course of a season. The following chunk loads the data and saves a line-plot:

Final version of the model:

Version 2 below had the following settings:

Version 4 has the same setup as the final version except:

Version 6 had the same setup as the final version, but the changes to the tuning factor were calculated based on the past 24h hours only:

Reviewer 2 requested a similar graph faceted by station but for the concentrations.